SEPSIS

Dataset and Attribute Information

This Exploratory Data Analysis (EDA)’s data is retrieved from a previous study Baghela (2022).

RNA-Seq and clinical data were gathered from five hospitals (“cohorts”) consisting of 348 patients from four emergency rooms (ER), one intensive care unit (ICU), and 44 controls. 82 samples were from the ICU, and the rest were from the ER. The data is distributed over five different datasets, two of which are of particular interest and the other four mainly containing IDs of genes and gene products. As mentioned, the data has been used in a previous study, in which the researchers were focused on extracting differentially expressed genes (DEGs) related to sepsis severity and mortality. Additionally, these researchers established endotypes based on immune-related genes. They identified five endotypes that were mechanistically and biologically different. We aim to perform a similar experiment during this project, but now entirely focusing on mitochondria-related genes. The approach is different since we want to extract those genes only, rendering our gene set size relatively smaller than the previous study, using all the genes in question. When establishing mitochondria-driven endotypes, we plan on extracting them with two essential variables (“class variables”): sepsis_severity and mortality. sepsis_severity consists of three different levels and is based on SOFA scores, namely ‘High’ (SOFA of above or equal to 5), ‘Intermediate’ (SOFA of above to and below 5), and ‘Low’ (SOFA of below 2). mortality is based on the outcome of a patient’s admission, either ‘deceased’, ‘survived’, or ‘unknown’. We plan on using these two variables to extract DEGs (more on that in de_and_pathway_analysis.Rmd). The main focus of this project is to undermine whether mitochondria-related genes have any “power” in predicting sepsis severity, not to necessarily come to a grandiose conclusion. Nevertheless, they were instead discovering if there is a way forward for more personalized medicine by focusing on mitochondria-related genes.

The data consists of the following datasets:

We will only give a quick overview of the metadata and the raw counts’ data features since these contain most of the essential data. Most of the data has been ‘cleaned’. However, since we are working with a subset of the raw count’s dataset, we will probably discover a different distribution and outliers. Additionally, the previous researcher identified outliers but did not have enough evidence to support removal. Also, many NA values are still present in the dataset. Some of these are introduced because of the differences in cohorts. For instance, some metadata is entirely attributed to ICU information. That introduces NAs for ER samples. Therefore, it is still crucial to do an in-depth data analysis.

The raw count dataset has a relatively simple structure; column names are sample IDs and identical to the feature sample_identifier in the metadata. Additionally, Ensembl IDs are present as row names. We will retrieve gene names from Ensembl and use these as row names instead. Raw counts are not normalized and require a more intensive approach than metadata.

We first start by looking at the metadata after importing all relevant libraries.

Preparation

We will start loading some libraries – not all because of function name conflicts (namely dplyr and biomaRt) – and read all relevant datasets. The reordered (from helper_functions.R) function ensures sample identifiers are in order in the metadata and count data.

library(affy)
library(caret)
library(corrplot)
library(dendextend)
library(glue)
library(kableExtra)
library(knitr)
library(reshape2)
library(tidyverse, warn.conflicts = F)
library(visdat)

# Import some essential helper functions
source('scripts/helper_functions.R')

mitocondria.genes <- read.table('data/source/GOCC_MITOCHONDRION.v2023.1.Hs.gmt') %>%
  dplyr::select(-c(1, 2)) %>% # Remove first two rows
  t() %>% 
  as.data.frame() %>% 
  dplyr::rename(gene_id = 1)
meta.data <- read.csv('data/source/GSE185263_meta_data_N392.csv')
raw.counts <- read.csv('data/source/GSE185263_raw_counts.csv') %>%
  column_to_rownames('ensembl_gene_id') %>%
  reordered(meta.data, .)
ids <- read.csv('data/source/UniqueID.csv', header = T, sep = ';')
gene.products <- read.table('data/source/Mitochondrion_GeneProduct.txt', sep = '\t') %>% 
  dplyr::rename(gene_id = 1, desc = 2, gene_product = 3)
sra <- read.table('data/source/SraRunTable.txt', sep = ',', header = T)

To give a quick overview of the results, we use the dim function on every dataset to give an overview of their dimensions. As shown in Table 3, we have a total of 392 patients, and the mitochondria gene dataset (1.669 genes) is smaller than the raw counts dataset (58.389 genes). The total loss of genes is -97.1415849%. The metadata contains all clinical information and has, therefore, many features. The other dataset has a relatively small amount of features.

# Dimensions of all datasets
dims <- data.frame(
  Meta = c(dim(meta.data)),
  Raw.Counts = c(dim(raw.counts)),
  Gene.Products = c(dim(gene.products)),
  SRA = c(dim(sra))
)

row.names(dims) <- c("Instances", "Features")

# Take a peek
kable(dims, format = "html", booktabs = TRUE, caption = "Dimensions of datasets") %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
Dimensions of datasets
Meta Raw.Counts Gene.Products SRA
Instances 392 58389 1382 392
Features 268 392 3 34

Meta Dataset

The meta-dataset contains all patients’ clinical information, such as gender, age, and information regarding their ER/ICU visit(s). Since we have a distinction between these samples and healthy controls, we have a lot of missing values (NAs). In this section, we will highlight the missing values and devise a plan to deal with them. That could mean removing some instances or attributes or needing to fill them up. For example, a healthy control could get a 0 inserted in an attribute regarding ER records. Filling up missing values is done in collaboration with experts. In addition, we look at the distribution of numeric attributes, and we might need to normalize some.

Base statistics

We look closer at the data types and the number of NAs per feature. The glimpse function performs this task. The metadata set contains so many features that displaying them here would not give an adequate picture. Therefore, a CSV is available in the misc/ directory. The SRA mainly contains information about the RNA-Seq data, collection information, creation data, etc. We do not need this information and will not touch this dataset again. IDs contain an identifier to NCBI and gene.products the gene name and their product. We only rename a very long feature for the metadata set.

glimpse(sra)
## Rows: 392
## Columns: 34
## $ Run                      <chr> "SRR16193822", "SRR16193823", "SRR16193825", …
## $ Age                      <int> 55, 22, 88, 47, 65, 75, 18, 57, 21, 57, 24, 1…
## $ Assay.Type               <chr> "RNA-Seq", "RNA-Seq", "RNA-Seq", "RNA-Seq", "…
## $ AvgSpotLen               <int> 100, 100, 100, 100, 100, 100, 100, 100, 100, …
## $ Bases                    <dbl> 2070022600, 1872169000, 1855266400, 107131190…
## $ BioProject               <chr> "PRJNA768419", "PRJNA768419", "PRJNA768419", …
## $ BioSample                <chr> "SAMN22043762", "SAMN22043763", "SAMN22043764…
## $ Bytes                    <dbl> 917046299, 784875882, 795907335, 457044135, 5…
## $ Center.Name              <chr> "MICROBIOLOGY AND IMMUNOLOGY, UNIVERSITY OF B…
## $ collection_location      <chr> "colombia", "colombia", "colombia", "colombia…
## $ collection_site          <chr> "Emergency Room", "Emergency Room", "Emergenc…
## $ Consent                  <chr> "public", "public", "public", "public", "publ…
## $ DATASTORE.filetype       <chr> "run.zq,sra,fastq", "sra,run.zq,fastq", "run.…
## $ DATASTORE.provider       <chr> "s3,gs,ncbi", "gs,s3,ncbi", "ncbi,gs,s3", "nc…
## $ DATASTORE.region         <chr> "s3.us-east-1,gs.US,ncbi.public", "ncbi.publi…
## $ disease_state            <chr> "sepsis", "sepsis", "sepsis", "sepsis", "seps…
## $ Experiment               <chr> "SRX12478539", "SRX12478538", "SRX12478537", …
## $ in_hospital_mortality    <chr> "Survived", "Survived", "Died", "Survived", "…
## $ Instrument               <chr> "Illumina HiSeq 2500", "Illumina HiSeq 2500",…
## $ Library.Name             <chr> "GSM5609007", "GSM5609006", "GSM5609005", "GS…
## $ LibraryLayout            <chr> "SINGLE", "SINGLE", "SINGLE", "SINGLE", "SING…
## $ LibrarySelection         <chr> "cDNA", "cDNA", "cDNA", "cDNA", "cDNA", "cDNA…
## $ LibrarySource            <chr> "TRANSCRIPTOMIC", "TRANSCRIPTOMIC", "TRANSCRI…
## $ Organism                 <chr> "Homo sapiens", "Homo sapiens", "Homo sapiens…
## $ Platform                 <chr> "ILLUMINA", "ILLUMINA", "ILLUMINA", "ILLUMINA…
## $ ReleaseDate              <chr> "2022-01-10T00:00:00Z", "2022-01-10T00:00:00Z…
## $ create_date              <chr> "2021-10-04T15:04:00Z", "2021-10-04T15:04:00Z…
## $ version                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Sample.Name              <chr> "GSM5609007", "GSM5609006", "GSM5609005", "GS…
## $ sex                      <chr> "female", "female", "female", "male", "female…
## $ sofa_24h_post_admisssion <int> 2, 2, 4, 2, 0, 2, 2, 2, 0, 2, 0, 6, 3, 0, 5, …
## $ source_name              <chr> "Whole Blood", "Whole Blood", "Whole Blood", …
## $ SRA.Study                <chr> "SRP339949", "SRP339949", "SRP339949", "SRP33…
## $ Tissue                   <chr> "Whole Blood", "Whole Blood", "Whole Blood", …
glimpse(ids)
## Rows: 392
## Columns: 2
## $ LibraryName <chr> "GSM5608946", "GSM5608947", "GSM5608948", "GSM5608949", "G…
## $ SampleName  <chr> "sepcol001", "sepcol002", "sepcol003", "sepcol004", "sepco…
glimpse(gene.products)
## Rows: 1,382
## Columns: 3
## $ gene_id      <chr> "GCK", "UQCRQ", "RARS2", "COX10", "TIMM23", "BRAF", "AGTP…
## $ desc         <chr> "Hexokinase-4", "Cytochrome b-c1 complex subunit 8", "Pro…
## $ gene_product <chr> "", "", "RARSL", "", "TIM23", "BRAF1|RAFB1", "CCP1|KIAA10…
meta.data <- meta.data %>%
  rename(icu_Anti.COVID_last_date = icu_Anti.COVID.Tx...Immune.Modulating.Tx.data.censored.on.date.of....last.sample.sent.)

Essential features

Now, we zoom in on the metadata and how some important features are distributed over ER and ICU cohorts. These features are related to SOFA scores, lactate measurements (an important indicator of mitochondria dysfunction), measurements of white blood cells, and other essentials. We tried to calculate the mean, standard error, total availability in raw and percentage per column named in ess_vars. table_1 holds all the information for all cohorts, the function calculate_metrics calculates the distinction between ER and ICU for every essential column. These calculations give us a better understanding of how the data is distributed. The results are depicted in Table 4 and 5.. These calculations give us a better understanding of how the data is distributed. The results are depicted in Table 4 and 5.

ess_vars <- c("sepsis_severity", "mortality", 
              "age", "gender", "first_at_ed_sofa", "icu_sofa", 
              "worst_within_72_sofa", "worst_within_72_lactate",
              "worst_within_72_neutrophil_count", "outcome_hospital_stay_days", 
              "patient_location", "icu_outcome_icu_stay_days", 
              "worst_within_72_total_cell_count", "icu_ANC")

table_1 <- meta.data %>%
  select(all_of(ess_vars)) %>%
  mutate(patient_location = 
           ifelse(patient_location == 'ward', 'icu', patient_location)) %>%
  mutate(age = as.numeric(age),
         first_at_ed_sofa = as.numeric(first_at_ed_sofa),
         icu_sofa = as.numeric(icu_sofa),
         worst_within_72_sofa = as.numeric(worst_within_72_sofa),
         worst_within_72_lactate = as.numeric(worst_within_72_lactate),
         worst_within_72_neutrophil_count = 
           as.numeric(worst_within_72_neutrophil_count),
         outcome_hospital_stay_days = 
           as.numeric(outcome_hospital_stay_days)) %>%
  mutate(gender = ifelse(gender == 0, 'male', 'female'))

icu <- table_1 %>% 
  filter(patient_location == 'icu') %>% 
  select(-patient_location)

er <-  table_1 %>% 
  filter(patient_location == 'er') %>% 
  select(-patient_location)

calculate_metrics <- function(df) {
  # empty dataset to which we can append to
  table_one <- data.frame(matrix(ncol = 5, nrow = 0))
  names(table_one) <- c("var", "total_av", "mean", "sd", "av")

  for (column in names(df)){
    col <- df[[column]]

    if (is.numeric(df[[column]]) == TRUE) {
      mean_value <- round(mean(df[[column]], na.rm = TRUE), 2)
      # calculate sd and thereafter standard error
      se_value <- round(sd(df[[column]], na.rm = TRUE)/
                          sqrt(sum(!is.na(df[column]))), 2)
      
    } else {
      # for categorical; no means or se
      mean_value <- NA
      se_value <- NA
    }
    
    total_available <- round(sum(!is.na(df[column]))/nrow(df[column])*100, 2)
    av <- sum(!is.na(df[column])) # total available 
    new_row <- c(column, total_available, mean_value, se_value, av)
    
    # bind to "table_one"
    table_one <- rbind(table_one, new_row)
  }
  return(table_one)
}

er_calc <- calculate_metrics(er)
names(er_calc) <- c("var", "total_av", "mean", "se", "av")
icu_calc <- calculate_metrics(icu)
names(icu_calc) <- c("var", "total_av", "mean", "se", "av")

er_calc %>%
  filter(av != 0) %>%
  kable(format = 'html', table.attr = "class='table table-bordered table-hover'") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
var total_av mean se av
sepsis_severity 100 NA NA 266
mortality 99.62 NA NA 265
age 100 56.05 1.26 266
gender 100 NA NA 266
first_at_ed_sofa 100 1.97 0.12 266
worst_within_72_sofa 100 1.31 0.14 266
worst_within_72_lactate 18.05 2.24 0.25 48
worst_within_72_neutrophil_count 56.02 6.59 0.41 149
outcome_hospital_stay_days 97.74 7.53 0.54 260
worst_within_72_total_cell_count 80.45 9.39 0.39 214
icu_calc %>%
  filter(av != 0) %>% # remove columns with no available information
  kable(format = 'html', table.attr = "class='table table-bordered table-hover'") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
var total_av mean se av
sepsis_severity 100 NA NA 82
age 98.78 61.41 1.7 81
gender 98.78 NA NA 81
first_at_ed_sofa 100 7.34 0.55 82
icu_sofa 100 7.34 0.55 82
icu_outcome_icu_stay_days 100 12.04 0.97 82
icu_ANC 98.78 9.48 0.67 81

Data Types and Missing values

In the next section, we look at the data distribution of the metadata. We are interested in the percentage of missing values per attribute and the data type of each attribute. For this, we use the library visdat. Since we have many attributes, we subdivide the amount of attributes into three plots via a function that generates a plot for every 67 attributes. We do this twice: one for datatypes and the other for NAs.

loop.vector <- c(1, 67, 134, 201)
missing.values.plot <- lapply(loop.vector, function(x) vis_dat(meta.data[x:(x+67)]) +
         labs(title = sprintf("Data type per feature, feature %s-%s", x, x+67), x = "Feature") + 
             theme(axis.text.x = element_text(angle = 90, hjust = 0, size = 7.5)))
plot(missing.values.plot[[1]])
Data types and NAs for features 1-67. Data types include character (red), integer (green), logical (blue), numeric (purple), and NA (grey).

Data types and NAs for features 1-67. Data types include character (red), integer (green), logical (blue), numeric (purple), and NA (grey).

plot(missing.values.plot[[2]])
Data types and NAs for features 67-134. Data types include character (red), integer (green), logical (blue), numeric (purple), and NA (grey).

Data types and NAs for features 67-134. Data types include character (red), integer (green), logical (blue), numeric (purple), and NA (grey).

plot(missing.values.plot[[3]])
Data types and NAs for features 134-201. Data types include character (red), integer (green), logical (blue), numeric (purple), and NA (grey).

Data types and NAs for features 134-201. Data types include character (red), integer (green), logical (blue), numeric (purple), and NA (grey).

plot(missing.values.plot[[4]])
Data types and NAs for features 201-268. Data types include character (red), integer (green), logical (blue), numeric (purple), and NA (grey).

Data types and NAs for features 201-268. Data types include character (red), integer (green), logical (blue), numeric (purple), and NA (grey).

Many NAs are present in the metadata. Basic information such as age and gender are all filled in. However, cohort-specific features contain a large share of the missing values. Besides, our metadata is diverse in the different data types. Next, let us zoom in on the missing values per feature via the function vis_miss. We will use the loop.vector variable once more.

plots.missing.values <- lapply(loop.vector, function(x) vis_miss(meta.data[x:(x+67)], cluster = T) +
                                 
labs(title = sprintf("Missing values per feature, feature %s-%s", x, x+67), x = "Feature") +
  theme(axis.text.x = element_text(angle = 90, size = 6.5)))
plot(plots.missing.values[[1]])
First set of features (1-68) displaying the NAs per feature. In addition, a percentage of total missing values is given per feature. 40,2% of observations were NAs and 59,8% filled in.

First set of features (1-68) displaying the NAs per feature. In addition, a percentage of total missing values is given per feature. 40,2% of observations were NAs and 59,8% filled in.

plot(plots.missing.values[[2]])
Second set of features (67-134) displaying the NAs per feature. In addition, a percentage of total missing values is given per feature. 57,2% of observations were NAs and 42,8% filled in.

Second set of features (67-134) displaying the NAs per feature. In addition, a percentage of total missing values is given per feature. 57,2% of observations were NAs and 42,8% filled in.

plot(plots.missing.values[[3]])
Third set of features (134-201) displaying the NAs per feature. In addition, a percentage of total missing values is given per feature. 51,4% of observations were NAs and 48,6% filled in.

Third set of features (134-201) displaying the NAs per feature. In addition, a percentage of total missing values is given per feature. 51,4% of observations were NAs and 48,6% filled in.

plot(plots.missing.values[[4]])
First set of features (201-268) displaying the NAs per feature. In addition, a percentage of total missing values is given per feature. 80,4% of observations were NAs and 19,6% filled in.

First set of features (201-268) displaying the NAs per feature. In addition, a percentage of total missing values is given per feature. 80,4% of observations were NAs and 19,6% filled in.

As shown in the figures - , many missing values are present in group-specific features. For example, features about patients located in the ICU have no data about ER patients and healthy controls. Since we are dealing with clinical data, it is vital to understand the context behind these missing values. We talked to Nicole Erler, an expert in the field of biostatistics. Erler (n.d.) In addition, “just” filling in a zero was also not the way to go, according to Erler. Our data is too challenging to impute quickly. It would be a sufficient half-year project on its own and probably not helpful. We, therefore, leave the missing values as they are for now. We could return later and still handle one of the more significant features.

To highlight the number of missing values even better, we construct a table with the number of missing values for every attribute in amount and percentage of all instances. This allows us to zoom in on those groups-specific features. We do this by counting all the NAs per patient_location (ER, ICU) in percentage.

table.missing.values <- meta.data %>%
  group_by(patient_location) %>%
  summarize(across(everything(), 
                   ~round(sum(is.na(.))/n()*100, 2))) %>%
  column_to_rownames('patient_location') %>%
  t()
# only keep features that exhibit 100% NAs over all cohorts
columns_delete <- table.missing.values %>% as.data.frame() %>%
  filter(across(everything(), ~ . == 100))

kable(columns_delete, format = "html", booktabs = TRUE, 
      caption = "Features that exhibit 100% NAs, across all groups") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
Features that exhibit 100% NAs, across all groups
er healthy_control icu ward
cdna_quantity 100 100 100 100
sequencing_date 100 100 100 100
worst_within_72_heart_map_target 100 100 100 100
icu_paired_sample 100 100 100 100
icu_tech 100 100 100 100
icu_cdna_prep_date 100 100 100 100
icu_cdna_quantity 100 100 100 100
icu_sequencing_date 100 100 100 100
icu_column_kit 100 100 100 100
icu_notes_sequencing_prep 100 100 100 100
icu_hospital_name 100 100 100 100
icu_recruitment_date 100 100 100 100
icu_triage_time 100 100 100 100
icu_time_first_seen_by_ed_physician 100 100 100 100
icu_ctas_score 100 100 100 100
meta.data <- meta.data %>%
  select(-all_of(rownames(columns_delete))) %>%
  mutate(patient_location = 
           ifelse(patient_location == 'healthy_control', 'hc', patient_location))

We visualize the results in Table and see that a few features are empty for all cohorts. We can safely delete these features.

We use the nearZeroVar function to delete some near-zero variables. As we did for glimpse, the result of the metadata set is available as CSV under misc/ (only for internal readers). We see some samples that exhibit not much variance. Removing these is still too early, but it is a valuable indicator. We will look at these samples in the section regarding the counts dataset.

#l <- nearZeroVar(meta.data, names=T) %>% write.csv("near_zero_meta.csv")
nearZeroVar(raw.counts, names=T)
##  [1] "sepcol024"  "sepcol027"  "sepnet1347" "sepnet1351" "sepnet1385"
##  [6] "sepnet1415" "sepwes015"  "sepwes021"  "sepwes026"  "sepwes032" 
## [11] "sepwes034"  "sepwes044"  "sepwes052"  "sepwes058"  "sepwes070" 
## [16] "sepwes080"  "sepwes103"  "hcwes025"   "hcwes027"   "sepcv006T0"
nearZeroVar(sra, names=T)
##  [1] "Assay.Type"       "BioProject"       "Center.Name"      "Consent"         
##  [5] "Instrument"       "LibraryLayout"    "LibrarySelection" "LibrarySource"   
##  [9] "Organism"         "Platform"         "ReleaseDate"      "version"         
## [13] "source_name"      "SRA.Study"        "Tissue"
nearZeroVar(gene.products, names=T)
## character(0)
nearZeroVar(ids, names=T)
## character(0)

Distribution

In this section, we look at the distribution of the meta-dataset. Since the meta dataset contains many attributes of the character type, we use a barplot to discover relationships between various features. Our primary interest is visualizing with the class variables, namely mortality, sepsis_severity, endotype_name, and sample_location. The first two named here are, of course, the most essential to look at. To accomplish this, we made a counter function that counts the amount of instances in two different features and summarizes it in a barplot. The !!rlang::sym(x) can be confusing; for instance, rlang::sym(x) takes the string x and converts it into a symbol. After that, !! is used for unquoting.

counter <- function(dataset, target, features) {
  agg <- dataset %>%
    select(all_of(features), !!sym(target)) %>%
    pivot_longer(cols = all_of(features), names_to = "feature", values_to = 'valuation') %>%
    group_by(!!sym(target), feature, valuation) %>%
    summarize(count = n(), .groups = 'drop')
  
 ggplot(agg, aes(x = valuation, y = count, fill = !!sym(target))) +
    geom_col() +
    facet_wrap(~ feature, scales = "free_x") +
    labs(x = "Value", y = "Count",
         title = glue("Various features distribution based on {target}"))
} 

In figure here, we see that the Toronto cohort is indeed for the ICU, and most samples are from the ER. Most people, no matter the admission, left the hospital alive; only a small portion died, which means the mortality variable may be limited in scope regarding DE extraction. Also, this variable contains a lot of NAs. And the entire ICU cohort has no information on mortality. For sepsis_severity, most High severity cases come from the ICU cohort and are also the smallest. This class imbalance may be problematic, especially for ER.

counter(meta.data, 'sample_location', c('sepsis_severity', 
                                        'patient_location', 
                                        'mortality'))
Distributions of `sample_location` with `mortality`, `patient_location`, and `sepsis_severity`. Sample location is an important variable since it shows the differences between the cohorts.

Distributions of sample_location with mortality, patient_location, and sepsis_severity. Sample location is an important variable since it shows the differences between the cohorts.

The figure below takes the severity variable as the counter. We see that most samples are between the age group of 51-60 to 71-80.

counter(meta.data, 'sepsis_severity', 
        c('age_group', 'patient_location', 'mortality', 'endotype_name'))
Distributions of `sepsis_severity` with `mortality`, `endotype_name` (established in the Baquir study), `mortality`, and `patient_location`. Severity is spread out over the age group. The amount of samples also increases with older populations. In contrast to the ER cohort, ICU has the largest share of High-severity cases in percentage.

Distributions of sepsis_severity with mortality, endotype_name (established in the Baquir study), mortality, and patient_location. Severity is spread out over the age group. The amount of samples also increases with older populations. In contrast to the ER cohort, ICU has the largest share of High-severity cases in percentage.

In figure below the age_group is used as the counter. Low severity is more pronounced in younger populations, and almost no one under the age of forty died because of sepsis.

counter(meta.data, 'age_group', c('mortality', 'sepsis_severity'))
Distributions of `age_group` with `mortality` and `sepsis_severity`. Low severity exhibits a higher share of younger populations, whereas older populations are more abundant in High and Intermediate subgroups. This phenomenon can also be seen in the mortality variable; younger age groups are less represented in the subcategory deceased.

Distributions of age_group with mortality and sepsis_severity. Low severity exhibits a higher share of younger populations, whereas older populations are more abundant in High and Intermediate subgroups. This phenomenon can also be seen in the mortality variable; younger age groups are less represented in the subcategory deceased.

Next, we mutate all the attributes that are supposed to be of the numeric type to that type. We do this since we are interested in the distribution of numeric values to determine whether we need to normalize/standardize these attributes. The attributes consisting of 0/1 will also be converted to TRUE and FALSE, respectfully. Thereafter, we will, as a temporary measure, select all numeric valuations, fill any NA in with 0 until later when they are filled correctly, and compare them. We plot per 20 features because of the larger number of features; the total number of numeric features is 116. We use the function plot_distributions for this. This function splits the data into indices specified by the user and shows the distribution of a particular variable. We use patient_location to color in the distribution. This function can, in practice, be used for all data types.

meta.data <- meta.data %>%
  mutate_if(~ all(. %in% c(0, 1, NA)), ~ as.logical(.))

numeric_data <- meta.data %>%
  select_if(is.numeric)

plot_distribution <- function(dataset, col_length, color_var) {
  p <- list()
  # calculate how many parts we need to split the data in
  indices <- ceiling(ncol(dataset)/col_length)
  
  # which variable is to be distinct upon (coloring in)
  color_var <- as.factor(color_var)

  for (i in 1:indices) {
    
    # calculate the start and end index
    start_col <- (i-1)*col_length+1
    end_col <- i*col_length
    
    if(end_col > ncol(dataset)) {
      end_col <- ncol(dataset)
    }
    
    p[[i]] <- dataset[, start_col:end_col] %>%
      # if we have character types, conver to factor and then to numeric
      mutate_if(is.character, as.factor) %>%
      mutate_if(is.factor, as.numeric) %>%
      bind_cols(color_var) %>%
# put the color_column as the last column (it might not be in 'dataset' param)
      rename(color_column = last_col()) %>%
      # longer format since we want to count
      pivot_longer(-color_column,
                   names_to = "features", values_to = "values") %>%
      ggplot(aes(x = values, fill = color_column)) +
      geom_histogram(bins = 15) +
      # for every feature, make a "small" plot
      facet_wrap(~ features, scales = "free") +
      theme(strip.text = element_text(size = 7))
  }
  
  return(p)
}

without.nas <- plot_distribution(numeric_data, 20, meta.data$patient_location)
# turn the NAs into 0 for comparison
numeric_data <- numeric_data %>%
  mutate_all(~ifelse(is.na(.), 0, .))
with.nas <- plot_distribution(numeric_data, 20, meta.data$patient_location)

By comparing all of the above figures, we see the influence of the NAs in action. As we highlighted, the number of NA values is cohort-dependent and needs special attention when using specific (later established significant) features. However, we can still use most of the features depicted here in a cohort fashion. For example, we make separate datasets for both when analyzing the ER cohort.

In the next section, we have a plot_corrrplot, which makes a corrrplot based on parameters a user can specify. The idea is to use all data for the correlation analysis but do it based on sample_location (Australia, Netherlands, etc.) to neglect the NA distribution a bit. However, NAs are still present and cannot be fully combatted. One disadvantage to this method is that some essential variable-variable comparisons cannot be calculated due to missing values. We only extract correlations above a certain threshold (standard is 0.7 using the parameter sig_thres). group simply refers to the group you perform the function on.

plot_corrrplot <- function(data, group=NULL, sig_thres = 0.7, 
                           method = 'spearman',
                           use='everything', pre=T) {
  
  if (pre) {
    # only for datasets with these attributes
    data <- data %>% as.data.frame() %>%
    select(-c(sample_identifier, GEO_identifier, sample_identifier_raw))
  }

  correlation_df <- data %>% as.data.frame() %>%
    mutate_if(is.character, as.factor) %>% # from character to factor
    mutate_if(is.factor, as.numeric) %>% # from factor to int
    cor(method = method, use = use) %>%
    as.data.frame() %>%
    # Put the variable (rowname) to a column
    rownames_to_column('variable') %>%
    # longer format - compare each variable against all others
    pivot_longer(-variable, names_to = 'comp2', values_to = 'score') %>% 
    drop_na() %>% # drop nas and diagonal
    filter(abs(score) > sig_thres & score != 1) %>%
    arrange(desc(abs(score))) %>%
    # dataframe to matrix for corrplot function
    acast(variable~comp2, value.var="score")
  
  corrplot(correlation_df, method ='ellipse', na.label = " ", 
           tl.cex = 0.7, is.corr = F, type = 'lower', 
           diag = F, main = glue("Correlation plot for {group}"))
}

corr_plots <- meta.data %>%
  group_by(sample_location) %>%
  group_walk(~ plot_corrrplot(.x, .y, 0.8))

It is hard to explain every correlation here, but in most figures above we can conclude that patient_location has a good correlation between various sepsis SOFA features (e.g., sepsis-3). Also, the class variable sepsis_severity has high correlations with other SOFA features (e.g., first_at_ed_sofa and icu_sofa).. We won’t be deleting any of these features.

As a final zoom-in of the metadata set, we looked at the SOFA scores per sample_location in figure . When calculating these boxplots, we removed the healthy controls entirely. Then we selected four different, highly correlated features with sepsis_severity. We can conclude that out of all the ER cohorts, Australia has the highest overall SOFA scores (not with qSOFA). Also, as expected, ICU cohort has the largest range of SOFA scores (between 15+ and ~ 2).

meta.data %>%
  filter(patient_location != 'hc') %>%
  select(icu_sofa, first_at_ed_sofa, sample_location, 
         at_ed_qsofa, worst_within_72_sofa) %>%
  pivot_longer(-sample_location) %>%
  ggplot(aes(x = name, y = value, color = sample_location)) +
  geom_boxplot() +
  labs(x = 'SOFA metric', y = 'Value', 
       title = 'SOFA score boxplots per sample location')
Boxplots for every cohort in various SOFA-related variables. The ER cohort Australia exhibited the highest SOFA scores of all the ER cohorts.

Boxplots for every cohort in various SOFA-related variables. The ER cohort Australia exhibited the highest SOFA scores of all the ER cohorts.

Raw Counts

In this section, we will first look at the entire raw counts dataset and zoom in on the mitochondria-related genes later. It is essential to first look for outliers and other anomalies here since this dataset covers all of the variance, which also impacts the mitochondria-related genes. First, we use DESeq2, a much-used library in a standard DE analysis workflow. We use the ~1 design to not specify any covariates - we are not comparing different conditions yet. However, this allows us to use some of DESeq2’s tools.

We use the whole population here, including the healthy controls. We perform PCA to get a basic understanding of the gene distribution at a sample level.

library(corrr)
library(reshape2)
library(pheatmap)
library(PoiClaClu)
library(grid)
library(gridExtra)
library(DESeq2)
library(factoextra)

( ddsMat <- DESeqDataSetFromMatrix(countData = raw.counts, 
                                   colData = meta.data, design = ~ 1) ) 
## class: DESeqDataSet 
## dim: 58389 392 
## metadata(1): version
## assays(1): counts
## rownames(58389): ENSG00000000003 ENSG00000000005 ... ENSG00000285509
##   ENSG00000285513
## rowData names(0):
## colnames(392): sepcol001 sepcol002 ... hchlD218 hchlD219
## colData names(253): sample_identifier GEO_identifier ... mortality
##   day_to_death
# normalize with VST
rld.dds.hc <- vst(ddsMat)

# scale=F due to low-expressed genes
pca_hc <- perform_pca(assay(rld.dds.hc), scale=F)

prepare_pca <- function(meta, pca_df) {
  pca_scores <- pca_df$pca$x %>%
    as_tibble(rownames = 'sample_identifier') %>%
    # merge with metadata
    full_join(x = ., y = meta, by = 'sample_identifier') %>%
    # mutate mortality to unknown if NA
    mutate(mortality = ifelse(is.na(mortality), 'unknown', mortality))
  
  # give the variance in percentage its own column (for plot_pca specifically)
  pca_scores <- pca_scores %>%
    mutate(var = pca_df$eigenvalues$pva)
  
  return(pca_scores)
}

plot_pca <- function(dataset, x_var='PC1', y_var='PC2', fill=NULL, 
                     color=NULL, size=NULL, shape=NULL, meta=NULL) {
  
  if (!is.null(meta)) {
    # prepare a dataset
    dataset <- prepare_pca(meta, dataset)
  }
  
  var <- dataset$var
  
  ggplot(dataset, aes_string(x = x_var, y = y_var, 
                             color = color, fill = fill, 
                             size = size, shape = shape)) +
  geom_point() +
  labs(title = glue("PCA regarding {color}"), 
       x = glue("PC1 ({round(var[1], 2)}%)"), 
       y = glue("PC2 ({round(var[2], 2)}%)")) +
  theme(plot.title = element_text(size = 10))
}

In the figure below, we see that the healthy controls form their own cluster. Even if it is just a little bit, we still see that samples from different clusters have a slight difference, even for healthy controls (Netherlands and Australia). ER has a large range, but based on sample location, there might be a bit of a bias here. We will look further into it. However, the ICU has a smaller range but overlaps with ER a lot.

plot_pca(pca_hc, color='patient_location', shape='sample_location', meta = meta.data)
PCA plot based on sample location, with the healthy controls included. As we can see, the healthy controls form their cluster, whereas the ER, ward, and ICU cohorts are much more spread out. The biggest cohort, ER, especially has a lot of overlap between its sample locations. However, the ICU also overlaps with the ER. In addition, samples from the same location cluster together well, stressing the need for a batch correction.

PCA plot based on sample location, with the healthy controls included. As we can see, the healthy controls form their cluster, whereas the ER, ward, and ICU cohorts are much more spread out. The biggest cohort, ER, especially has a lot of overlap between its sample locations. However, the ICU also overlaps with the ER. In addition, samples from the same location cluster together well, stressing the need for a batch correction.

We will now filter out the healthy controls and focus entirely on septic patients from now on.

meta.sepsis <- meta.data %>%
  filter(!patient_location %in% c('healthy_control', 'hc'))

sepsis.counts <- raw.counts %>%
  select(-starts_with('hc')) %>%
  reordered(meta.sepsis, .)

# outer () prints a summary!
( ddsMat <- DESeqDataSetFromMatrix(countData = sepsis.counts, 
                                   colData = meta.sepsis, design = ~ 1) ) 
## class: DESeqDataSet 
## dim: 58389 348 
## metadata(1): version
## assays(1): counts
## rownames(58389): ENSG00000000003 ENSG00000000005 ... ENSG00000285509
##   ENSG00000285513
## rowData names(0):
## colnames(348): sepcol001 sepcol002 ... sepcv702W1 sepcv703W1
## colData names(253): sample_identifier GEO_identifier ... mortality
##   day_to_death

In addition to PCA, we will plot the density of the sepsis-related count data in its unnormalized form and three different standardization and normalization efforts—namely, min-max, log2, and VST. The latter is made explicitly for normalizing RNA-Seq data; it is designed to stabilize the variance across the range of mean values in count data. We use blind=TRUE (standard) since we do not consider differences between samples yet. assay() is used to extract normalized counts. Figures below cover the normalization techniques. We are plotting with plotDensity() from the library affy.

min.max.normalization <- function(x) {
    return((x - min(x)) / (max(x) - min(x)))
}

counts.log <- log2(sepsis.counts + 0.01) # + 0.01 to avoid zero valuations
counts.norm <- min.max.normalization(sepsis.counts)
rld.dds <- vst(ddsMat)
rld <- assay(rld.dds)
plotDensity(sepsis.counts, xlab = 'Unnormalized counts',
            main = 'Expression distribution (unnormalized)')
Distribution of unnormalized RNA-Seq counts from septic patients.

Distribution of unnormalized RNA-Seq counts from septic patients.

The figure above concludes the need for normalization; count values are way too distributed.

plotDensity(counts.log, lty=c(1:ncol(counts.log)), 
            xlab = 'Log2(sepsis.counts)', 
            main = 'Expression distribution (log2 normalized)')
Density plot on log-normalized counts.

Density plot on log-normalized counts.

The log2-normalization technique already improved the distribution a lot.

plotDensity(counts.norm, lty=c(1:ncol(counts.log)), 
            xlab = 'sepsis.counts', 
            main = 'Expression distribution (min-max standardized)')
Density plot on min-max-standardized counts.

Density plot on min-max-standardized counts.

Looks the same as the unnormalized distribution and has the problem of still skewing the distribution, making it sensitive to outliers.

plotDensity(rld, lty=c(1:ncol(rld)), xlab = 'sepsis.counts', 
            main = 'Expression distribution (VST standardized)')
Density plot on VST normalized counts.

Density plot on VST normalized counts.

VST normalization is one of the preferred methods in RNA-Seq data and is useful for downstream analysis. The distribution range is still large, but that is normal in RNA-Seq. With that, we use this normalization technique during the rest of the project.

Next, we experimented with different parameters to filter out low-expressed genes. Typically, counts beneath ten are considered useless, exhibited at least over x samples. We tried to count thresholds of 1-15 (in low.exp), at least over an x amount of samples (in sample), and whether to normalize the counts first and, after that, remove a certain threshold (in norm) below. We covered 300 different parameter settings. Unfortunately, there are too many to show in a table format, and gene size varies greatly. We went with counts that cannot be lower than or equal to 10 and at least over ten samples. We did not normalize before filtering out; this resulted in too many lost genes. Additionally, it is standard practice not to do this before filtering out low-expressed genes. After that, we compared gene sets with and without low-expressed genes. After filtering, the variance increased slightly for both PC1 and PC2, which is a good sign.

# to get estimateSizeFactors to use `normalized` parameter
dds <- DESeq(ddsMat)

different_params <- data.frame(outcome = numeric(), used_comb = character())
combinations <- crossing(low.exp = 1:15, sample = 1:10, norm = c(TRUE, FALSE))

for(i in 1:nrow(combinations)) {
  combination <- combinations[i, ]
  
  keep <- rowSums(counts(dds, normalized=TRUE) >= combination$low.exp) >= combination$sample
  outcome <- ddsMat[keep, ] %>% dim() # only retain the amount of counts
  # make a reference to this combination
  used_comb <- glue("{combination[1]}-{combination[2]}-{combination[3]}")

  different_params <- bind_rows(different_params, 
    data.frame(outcome = outcome[1], used_comb = used_comb))
}

different_params <- different_params %>%
  separate(used_comb, into = c("low_exp", "sample", "norm"), sep = "-")

head(different_params)
keep <- rowSums(counts(dds) >= 10) >= 10
dds <- dds[keep, ]
rld.two <- assay(vst(dds))

# with and without low-expressed genes
without_low <- perform_pca(rld, scale = F) %>% 
  plot_pca(color = "patient_location", meta = meta.sepsis)
with_low <- perform_pca(rld.two, scale = F) %>% 
  plot_pca(color = "patient_location", meta = meta.sepsis)
grid.arrange(without_low, with_low, ncol=2,
             top = "PCA plots: Influence of low-expressed genes")
PCA plot with (a) and without (b) low-expressed genes. Variance increases without the low-expressed genes.

PCA plot with (a) and without (b) low-expressed genes. Variance increases without the low-expressed genes.

In the table below, we can see how many we “lost”. We went from 58.389 to 17.825 genes with sufficient expression. Namely, a reduction of 1.7824^{4}%.

data.frame(
  before = dim(rld),
  after = dim(rld.two)
) %>%
  kable(format = 'html', booktabs = TRUE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
before after
58389 17825
348 348
# overwrite rld for simplifity
rld <- rld.two

Further dimension reduction

We have used PCA before. We do this with function from the helper_functions.R library (own functions). As a note, since we lost all the low-expressed genes, we now can safely use the scale=TRUE parameter. We will only show it once here, but the rest of the project it will be seen as standard usage. We, however, lost some of the variance because we scaled.

pca_res <- perform_pca(rld, scale = T)

The following three plots visualize the total variance explained per PC. As we can see in figure, the first PC contains more than 15% of the total variance, and the second one as well, and after that, the variance explained per PC decreases to less than 5% per additional PC. Figure cumulatively shows this. The inclusion of 89 PCs explains 80% of the total variance. We summarize both described figures in figure , a so-called Pareto plot. This figure shows how the cumulative variance increases per PC via the black-dotted line and stops at around 89 PCs to describe 80% of the total variance. Using this as a dimension reduction effort, we would go from >17.000 genes (/features) to 89 features. We calculate with the explain_variance function the amount of PCs we need to explain 80% of the variance.

ycord <- explain_variance(pca_res$eigenvalues$p_cum, 80)

ggplot(pca_res$eigenvalues[1:10, ], aes(x = PC, y = pva)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Scree plot", x = "PC", y = "Variance in percentage")
Scree plot with the first 10 principal components

Scree plot with the first 10 principal components

ggplot(pca_res$eigenvalues[1:ycord, ], aes(x = PC, y = p_cum)) +
  geom_col(color = "black") +
  geom_hline(yintercept = 80,
             linetype = 'dashed', color = 'red') +
  labs(title = "Variance explained by principal components",
       x = 'PC',
       y = "Cumulative percentage variance explained") +
  theme(axis.text.x = element_text(angle = 90,size = 7))
Barplot with cumulative variance explained in percentage per principal component. Red line indicates 80% of total variance explained.

Barplot with cumulative variance explained in percentage per principal component. Red line indicates 80% of total variance explained.

# Pareto chart
ggplot(pca_res$eigenvalues[1:ycord,], aes(x = PC)) +
  geom_col(aes(y = pva)) +
  geom_line(aes(y = p_cum, group = 1)) + # group = 1: treat all data as one 
  geom_point(aes(y = p_cum)) +
  geom_hline(yintercept = 80, linetype = 'dashed', color = 'red') +
  labs(x = "Principal Component", y = "Fraction variance explained") +
  theme(axis.text.x = element_text(angle = 45, size = 7))
Pareto chart of the first 95 principal components. Each principal component expresses a certain percentage of total variance explained. The cumulative percentage of variance explained is summarized in the dotted line, which stops when it reaches the red line, indicating 80% of total variance explained.

Pareto chart of the first 95 principal components. Each principal component expresses a certain percentage of total variance explained. The cumulative percentage of variance explained is summarized in the dotted line, which stops when it reaches the red line, indicating 80% of total variance explained.

Next, we cluster around multiple class variables. The figure here is most important as we cluster around sepsis_severity and mortality. This figure shows that the spread of all groups of samples seems random; no distinction of any groups can be seen here. This is concerning as a non-obvious way of grouping is achievable with these class variables; a supervised machine learning algorithm might need to be able to predict accurately.

prep_pca <- prepare_pca(meta.sepsis, pca_res)
plot_pca(prep_pca, color = 'sepsis_severity', shape = 'mortality')
Clustering on `sepsis_severity` with groups High in red, Intermediate in green, and Low in blue; and `mortality` with groups 'unknown' (former NA; temporary measure), deceased as triangle, and survived as square. Clusted groups seem random, and no major distinction can be made.

Clustering on sepsis_severity with groups High in red, Intermediate in green, and Low in blue; and mortality with groups ‘unknown’ (former NA; temporary measure), deceased as triangle, and survived as square. Clusted groups seem random, and no major distinction can be made.

As observed before, in the metadata section, we expected some effects from the various sample locations. Now, we will try to visualize the effects through PCA (and also visualize other features). If we look at the first subfigure in figure below, we see that sample_location clusters well. This is the batch effect - cluster form because they are from the same batch. We will correct this. To accomplish this, we made a function in helper_functions.R while using SVA’s ComBatt function on the sequence_month_year variable. This way, we have a function that corrects with DESeq2’s VST and does batch correction.

In the next section, we perform the batch correction and thereafter do PCA again. If we look at the same subfigure, we see that now the sample_location feature does not cluster so well anymore.

counts.batch <- normalize(assay(dds), meta.sepsis)
## Found 238 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.
# perform pca again, now on batch-corrected counts
pca_res_batch <- perform_pca(counts.batch, scale = F)
prep_pca <- prepare_pca(meta.sepsis, pca_res_batch)
multiple_pcas <- lapply(c('sample_location', 'mortality', 
                          'endotype_name', 'sepsis_severity'), 
       function(x) plot_pca(prep_pca, color = x))

grid.arrange(grobs = multiple_pcas, 
             top = "PCA on multiple variables with batch correction")

Outliers

Outliers are though to identify just by looking at a PCA plot since our data does not cluster all that well, especially on class variables. Therefore, we turned to sklearn’s IsolationForest. This model works by isolating anomalies based on low frequency and being different. It works well with high-dimensional datasets and works quickly. Its use case is typically outside of RNA-Seq datasets but can be a good indicator of outliers. We will not solely rely on this model to remove entire samples; it is just a way of labeling potential outliers. We will still zoom in on these so-called outliers and decide whether to remove them. RNA-Seq data can be noisy and complex, and some outliers can be biologically interesting to look at. IsolationForest is a Pythonic model and is therefore not included in this notebook. Please find it in the detect_outliers.ipynb.

The samples in possible_outliers have been identified by IsolationForest. We compare their gene expression to twenty randomly chosen ‘inliner’ samples. To accomplish this, we needed a long format to count all the gene expressions per sample.

# possible outliers detected by IsolationForest
possible_outliers <- c(
 'sepcol002',
 'sepcol007',
 'sepcol014',
 'sepcol061',
 'sepcol065',
 'sepwes014',
 'sepcv010T0')

inliner_samples <- rld %>% as.data.frame() %>%
  select(!all_of(possible_outliers))

# sample 20 inliners
names_samples <- sample(names(inliner_samples), 20)

sample_df <- inliner_samples %>%
  select(all_of(names_samples)) %>%
  rownames_to_column('gene')

outlier_samples <- rld %>% as.data.frame() %>%
  select(all_of(possible_outliers)) %>%
  rownames_to_column('gene')

# switch to long format and give outlier and inliner different colors
combined_long <- outlier_samples %>%
  right_join(sample_df, by = 'gene') %>%
  pivot_longer(-gene, names_to = "sample", values_to = "gene_exp") %>%
  mutate(color = ifelse(sample %in% names_samples, "red", "blue"))

In the next plot with boxplots, we distinguish between outlier and inline expression patterns. The outliers detected by IsolationForest do not differ as much from the inliners. We will remove them and see the results in PCA.

ggplot(combined_long, aes(x = sample, y = gene_exp, fill = color)) +
  geom_boxplot() +
  labs(title = "Boxplot of outliers and a random set of inliners", 
       x = "Sample", y = "Gene exxpression") +
  scale_fill_discrete(labels = c("red" = 'Outlier', "blue" = "Inliner")) +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
  guides(fill = guide_legend(title = "Type")) 
Through outlier detection by IsolationForest, we compare these outliers (in blue) to twenty random samples from the inliner population (in red). Only two of these sample outliers show a large distinction to the inliners.

Through outlier detection by IsolationForest, we compare these outliers (in blue) to twenty random samples from the inliner population (in red). Only two of these sample outliers show a large distinction to the inliners.

In the next section, we removed the samples in possible_outliers and compared the results to previous PCA plots to measure the outliers’ influence. Concluding that removing the outliers would not “solve” many clustering problems. Therefore, we will not be removing outliers here. We will look at outliers in the mitochondria-related gene section once more.

pca_res_inliners <- perform_pca(inliner_samples)
red_meta <- meta.sepsis %>%
  filter(!sample_identifier %in% possible_outliers)

plot_pca(pca_res_inliners, meta = red_meta, color = 'sepsis_severity')
PCA on inliner population on `sepsis_severity` to measure the influence of the outliers detected by IsolationForest.

PCA on inliner population on sepsis_severity to measure the influence of the outliers detected by IsolationForest.

We use the PC loadings to determine which genes contribute the most to the first two PCs. We accomplish this by using the function top_loadings. This function first decides what target to use; if it is a gene, we take a closer look at its loadings established in PCA. If not, it might be a sample, and then we use the standard PCA (through $x). We then select the first two PCs and determine which genes or samples have the highest score for each of the PCs. Then, we slice five from each, resulting in 10 interesting genes or samples. Thereafter, we plot the reduced PCA dataset with the fviz_pca_var from the package factoextra. The length of the arrow here depicts the ‘importance’ of the gene or sample to the established variance of that said PC. The direction says something about how correlated genes are with others (and whether they are negatively or positively correlated).

top_loadings <- function(pca_df, target) {
  if (target == 'gene') {
    # if target is gene, pick loadings
    pc_loadings <- as.data.frame(pca_df$pca$rotation) %>%
      rownames_to_column(target) 
  } else {
    # else go for the x, the PCs
    pc_loadings <- as.data.frame(pca_df$pca$x) %>%
      rownames_to_column(target) 
  }
  
  top_performers <- pc_loadings %>%
    select(!!sym(target), PC1, PC2) %>%
    # to long format
    pivot_longer(-!!sym(target), names_to = 'PC', values_to = 'score') %>%
    group_by(PC) %>%
    mutate(absolute_Score = abs(score)) %>%
    arrange(desc(absolute_Score)) %>%
    # slice the best-performing for each PC
    slice_max(order_by = absolute_Score, n = 5) %>%
    # only keep unique genes (can overlap between PCs)
    distinct(!!sym(target))
  
  return(top_performers)
}

top_genes <- top_loadings(pca_res_batch, 'gene')

fviz_pca_var(pca_res_batch$pca, col.var="contrib",
             gradient.cols = c("blue", "orange", "red"),
             repel = TRUE, select.var = list(name=top_genes$gene))
Top-contributing genes' loadings in the first PCA components, contributing most to the variance.

Top-contributing genes’ loadings in the first PCA components, contributing most to the variance.

We extract the top-contributing genes from the rld (VST-normalized counts) and depict their expression with each other in a heatmap. Some genes have high expression rates (>2), especially gene ENSG000001609132. We do not know their gene name or if they are mitochondria-related. However, we have saved the result into a CSV and will look later to see if these are mitochondria-related genes.

top_scores <- rld %>%
  as.data.frame() %>%
  rownames_to_column('gene') %>%
  filter(gene %in% top_genes$gene) %>%
  column_to_rownames('gene') %>%
  scale()

pheatmap(top_scores, show_colnames =  F, 
         main = "Heatmap of the top-contributing genes", scale='column')
Heatmap of top-contributing genes, five for each of the first two components.

Heatmap of top-contributing genes, five for each of the first two components.

We do a similar process for samples, but now we try to extract samples that contribute best to the first two PCs based on variance. We extracted the top 10 most contributing for the first five PCs. Most samples come from the Toronto cohort, recognizable from the col notation.

top_score <- top_loadings(pca_res, target = 'id')

pheatmap(top_scores[, 1:5], show_rownames =  F, show_colnames =  F,
         main = "Heatmap of the top-contributing samples")

We now turn to hierarchical clustering, exploring if there is a way to cluster our gene expression. We perform a simple hierarchical clustering method based on a reasonably standard workflow: the distance method is euclidean, and the linkage method ward.D2. After that, we cut the tree in three (denoted by the fact that we have three levels of severity in sepsis_severity). The dendrogram depicted below has evenly big clusters.

dist_matrix <- dist(pca_res_batch$pca$x[, 1:10], method = 'euclidean') 
hc <- hclust(dist_matrix, method = 'ward.D2')
cluster_labels <- cutree(hc, 3)
dend_colored <- color_branches(as.dendrogram(hc), k = 3, 
                               labels = cluster_labels)
plot(dend_colored, hang = -1, horiz = TRUE, leaflab = "none")
Hierarchical clustering on principal components (the first ten) to establish which samples are closely related. However, due to the grand total of samples, it is hard to observe relationships. Therefore, other figures that paint a clearer picture are constructed.

Hierarchical clustering on principal components (the first ten) to establish which samples are closely related. However, due to the grand total of samples, it is hard to observe relationships. Therefore, other figures that paint a clearer picture are constructed.

We now associate the established clusters in the previous graph with sepsis_severity by joining them based on the sample identifier. Thereafter, we calculate and visualize the findings through PCA. We see that the severity variable does not cluster well.

cluster_groups <- data.frame(PC1 = pca_res_batch$pca$x[, 1], 
                             PC2 = pca_res_batch$pca$x[, 2],
                             var = pca_res_batch$eigenvalues$pva,
                             cluster = factor(cluster_labels)) %>%
  rownames_to_column('sample_identifier') %>%
  left_join(meta.sepsis, by = 'sample_identifier')

plot_pca(cluster_groups, 'PC1', 'PC2', 
         color = 'sepsis_severity', shape = 'cluster')
Plot the new grouping based on the results from hierarchical clustering and to see if clusters have any overlap with severity. Severity does not align with the established clusters in hierarchical clustering.

Plot the new grouping based on the results from hierarchical clustering and to see if clusters have any overlap with severity. Severity does not align with the established clusters in hierarchical clustering.

To conclude our findings, we plot the two PCs into density plots in figures below. Clusters are somewhat evenly large for both PCs.

cluster_long <- cluster_groups %>%
  pivot_longer(cols = c(PC1, PC2))

ggplot(cluster_long, aes(x = value, fill = cluster)) +
  geom_density(alpha = 0.7) +
  facet_wrap(~ name) +
  labs(title = "Density Plot of PC1 by clusters")
Density plots of PC1 and PC2 and where the clusters are located.

Density plots of PC1 and PC2 and where the clusters are located.

Conclusion

In conclusion, we looked at correlations, distributions, and missing values for the metadata set. However, due to the cohort dependency of these NAs, it was hard to impute them. According to missing value expert Erler, failing in a zero is also not the way to go. It creates a distortion of statistics and misrepresents the true data. In addition, we remove near-zero features from the metadata set. A large number of missing values in mortality, especially in the ICU cohort, can become problematic for downstream analysis.

We did not remove any samples, but there are outliers in our count data. However, getting evidence of removing these outliers is difficult since we do not see any formation of clusters on class variables. We also decided to take that VST normalization and batch correction were necessary. This produced better distribution for the entire transcriptome and mitochondria-related genes. We also discovered that the sepsis_severity class variable does not cluster well. This is not a strange phenomenon in RNA-Seq and confirms the challenges of the heterogeneous nature of sepsis.

References

Baghela, A. et al. “Predicting sepsis severity at first clinical presentation: The role of endotypes and mechanistic signatures”. eBioMedicine vol. 75, 103776, January 2022. DOI:https://doi.org/10.1016/j.ebiom.2021.103776 Erler. N: https://www.nerler.com/ https://www.ensembl.org/index.html https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/GOCC_MITOCHONDRION

Baghela, A. et al. 2022. “Predicting Sepsis Severity at First Clinical Presentation: The Role of Endotypes and Mechanistic Signatures.” eBioMedicine 75: 103776. https://doi.org/10.1016/j.ebiom.2021.103776.
“Ensembl.” n.d. https://doi.org/https://www.ensembl.org/index.html.
Erler, N. n.d. “Erler Website.” https://doi.org/https://www.nerler.com/.
“GOCC_MITOCHONDRION - Gene Set.” n.d. https://doi.org/https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/GOCC_MITOCHONDRION.